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ABSTRACT 

Summary: A MapReduce-based implementation called MR- 
MSPolygraph for parallelizing peptide identification from mass 
spectrometry data is presented. The underlying serial method, 
MSPolygraph, uses a novel hybrid approach to match an 
experimental spectrum against a combination of a protein sequence 
database and a spectral library. Our MapReduce implementation 
can run on any Hadoop cluster environment. Experimental results 
demonstrate that, relative to the serial version, MR-MSPolygraph 
reduces the time to solution from weeks to hours, for processing tens 
of thousands of experimental spectra. Speedup and other related 
performance studies are also reported on a 400-core Hadoop cluster 
using spectral datasets from environmental microbial communities as 
inputs. 

Availability: The source code along with user documentation are 
available on http://compbio.eecs.wsu.edu/MR-MSPolygraph. 
Contact: ananth@eecs.wsu.edu; william.cannon@pnnl.gov 
Supplementary Information: Supplementary data are available at 
Bioinformatics online. 

Received on February 4, 2011; revised on July 18, 2011; accepted 
on September 1, 2011. 

1 INTRODUCTION 

Identifying the sequence composition of peptides is of fundamental 
importance to systems biology research. High-throughput proteomic 
technologies using mass spectroscopy are capable of generating 
millions of peptide mass spectra in a matter of days. One of the 
most effective ways to annotate these spectra is to compare the 
experimental spectra against a database of known protein sequences. 
The main idea here is to generate candidate peptide sequences 
from the genome of the organism under study and then to use 
models of peptide fragmentation to generate model spectra that 
can be compared against each experimental spectrum. However, 
as samples become richer in diversity (e.g. from environmental 
microbial communities), the number of candidate comparisons could 
increase by orders of magnitude (Supplementary Figure SI). An 
increase in the number of candidates also increases the probability 
of finding high-scoring, random matches. It is therefore essential to 
implement a peptide identification method that is both accurate and 
scalable to large sizes of spectral collections and sequence databases. 
The prediction accuracy of peptide identification can be improved 
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if experimental spectra are also compared against spectral libraries, 
although this would only exacerbate the computational demands. 

Recently, Cannon et al. (201 1) developed a novel hybrid statistical 
method within the MSPolygraph framework, which combines the 
use of highly accurate spectral libraries, when available, along with 
on-the-fly generation of model spectra when spectral libraries are 
not available. This method demonstrated increases of 57-147% in 
the number of confidently identified peptides at controlled false 
discovery rates. This effort to enrich quality of prediction, however, 
comes at an increased computational cost. While a parallel MPI 
version of the code exists, most users do not have access to 
large-scale parallel clusters. Whereas, open-source science cloud 
installations and commercial vendors such as Amazon provide 
access to MapReduce clusters on an on-demand basis. 

In this article, we present a MapReduce implementation of 
MSPolygraph called MR-MSPolygraph. MapReduce (Dean and 
Ghemawat, 2008) is an emerging parallel paradigm for data 
intensive applications, and is becoming a de facto standard in 
cloud installations. One of the popular open- source implementations 
for MapReduce is the Hadoop framework. MR-MSPolygraph uses 
MapReduce to efficiently distribute the matching of a large 
spectral collection on a Hadoop cluster. Previously, Halligan et al. 
(2009) ported peptide identifications tools that use the database 
search approach onto the Amazon EC2 cloud environment. Our 
work incorporates the statistics of the hybrid search method in 
MSPolygraph to any cluster running the open- source Hadoop 
environment. 

2 METHODS 

MR-MSPolygraph is designed to achieve parallelism across the number of 
experimental spectra to be matched. The MapReduce framework requires 
developers to define two functions: mapper and reducer. In our case, since 
the processing of each spectrum is independent of one another, we take 
advantage of the inherent data parallelism by splitting the input experimental 
spectra across map tasks. More specifically, the user inputs: (i) (queries) 
a set of experimental spectra to be matched; (ii) (database) a fasta file 
containing known protein/peptide sequences; (iii) (spectral library) a set of 
peptides to be used as the spectral library (required only when the software 
is run in the 'hybrid' mode); and (iv) a file with quality control and output 
parameters. In addition, the user specifies a desired number of map tasks. The 
algorithm executes as follows: first, the queries are automatically partitioned 
into roughly equal sized chunks and supplied as input to each map task. 
The chunk size can be controlled either by altering the number of map tasks 
and/or the min. split. size parameter within Hadoop. Each map task then runs a 
modified implementation of the serial MSPolygraph code, which matches the 



© The Author(s) 201 1 . Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



MapReduce implementation 




tti nput experi me n tal s pectra N umbe r of map tasks 



Fig. 1. Performance of MR-MSPolygraph: (a) Runtime as a function of the input number of spectra, keeping the number of map tasks fixed at 400; and 
(b) speedup of the hybrid version relative to 100 map tasks, for varying input sizes. The number of map tasks is generally equal to the number of cores used, 
although that could slightly vary as determined by Hadoop at runtime. 



local batch of queries against the entire database, and also against the spectral 
library if run on the hybrid mode. The map tasks then output, in one file per 
task, a list of hits (sorted by statistical significance) for each of their queries. 
The algorithm has a worst-case complexity of 0(q(n\+ri2)lp), where q is 
the number of experimental spectra, p is the number of mappers and n\ and 
ni are the sizes of the database and spectral library, respectively. Since the 
mappers' output cover different subsets of queries, the reducer functionality 
is not used. However, if it is desired to have all the hits reported in one output 
file, then it can be achieved using a single reducer. More usage details and 
parameter descriptions can be found at the software web site. 

3 RESULTS 

MR-MSPolygraph was tested on the Magellan Hadoop cluster at 
National Energy Research Scientific Computing Center (NERSC). 
The cluster has 75 nodes with a total of 600 cores dedicated 
for Hadoop, where each node has 2 quad cores Intel Nehalem 
2.67 GHz processors and 24 GB DDR3 1333 MHz RAM. These 
nodes run Cloudera's distribution for Hadoop 0.20.2 + 228. In our 
experiments, we used the following datasets: (i) a collection of 
64 000 experimental spectra obtained from Synechococcus sp. PCC 
7002; (ii) a database containing 2.65 million microbial protein 
sequences downloaded from NCBI GenBank; and (iii) a spectral 
library containing a set of 1752 S.Oneidensis MR-1 spectra. 

Figure la shows the runtime of MR-MSPolygraph as a function 
of input number of spectra (from IK to 64K). Both modes of 
the software, hybrid and database only, were tested. As expected, 
the runtime grows linearly with the input number of spectra. 
Furthermore, both the hybrid and database-only versions take almost 
identical times, indicating that the additional cost of matching 
against the spectral library is negligible for this input. It can be 
expected that this cost grows gradually with the size of spectral 
library used. 

We also studied the performance by measuring the parallel 
runtime as a function of the number of map tasks used. 
Supplementary Table SI shows the runtimes and Figure lb shows 
the corresponding speedup up to 400 map tasks, calculated relative to 
the corresponding 100 mapper run. As can be observed, the runtime 



roughly halves with doubling of the number of map tasks and the 
speedup becomes linear for larger inputs (e.g. 398 x on 400 map 
tasks for 64K spectra). This can be expected as for smaller inputs; 
the overhead of loading the database and spectral library is likely 
to dominate in larger processor sizes. Perhaps the merits of Hadoop 
parallelism become more evident upon comparing its performance 
against a serial implementation. For instance, to match the entire 
collection of 64000 spectra in hybrid mode, the MSPoly graph's 
serial implementation can be estimated to take >2000 CPU hours 
using a state-of-the-art desktop computer; whereas, our Hadoop 
implementation finishes this task in ~6 h using 400 cores. We also 
studied the effect of changing task granularity for each map task and 
the results are summarized under Supplementary Material. 

ACKNOWLEDGEMENTS 

We thank Dr Ramakrishnan at NERSC for offering extensive help 
with the set up of Hadoop environment. And, the National Energy 
Research Scientific Computing Center (NERSC) at Lawrence 
Berkeley National Laboratory. 

Funding: This work was supported by the National Science 
Foundation (IIS 0916463 to A.K. and W.R.C.) and Department 
of Energy's Office of Biological and Environmental Research and 
Office of Advanced Scientific Computing Research under contracts 
(57271 and 54976 to W.R.C). 

Conflict of Interest: none declared. 
REFERENCES 

Cannon, W.R. et al. (2011) Large improvements in MS/MS based peptide identification 

rates using a hybrid analysis. J Proteome Res., 10, 2306-2317. 
DeanJ. and Ghemawat,S. (2008) MapReduce: simplified data processing on large 

clusters. Commun. ACM, 51, 107-113. 
Halligan,B.D. et al. (2009) Low-cost, scalable proteomics data analysis using Amazon's 

cloud computing services and open source search algorithms. /. Proteome Res., 8, 

3148-3153. 



3073 



